In our final group assignment, we will analyse data of Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in Berlin. We will download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.
vroom will download the *.gz zipped file, unzip, and provide us with a dataframe.
There are many variables in the dataframe. Here is a quick description of some of the variables collected, and you can find a data dictionary here
price = cost per night
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood*: three variables on a few major neighbourhoods in each city
We start by looking at the raw values.
listings %>%
glimpse()Rows: 18,288
Columns: 74
$ id <dbl> 2015, 3176, 7071, 9991, 1~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-22, 2021-09-22, ~
$ name <chr> "Berlin-Mitte Value! Quie~
$ description <chr> "Great location! <br />3~
$ neighborhood_overview <chr> "It is located in the for~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 2217, 3718, 17391, 33852,~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Ion", "Britta", "BrightR~
$ host_since <date> 2008-08-18, 2008-10-19, ~
$ host_location <chr> "Key Biscayne, Florida, U~
$ host_about <chr> "Isn’t sharing economy gr~
$ host_response_time <chr> "within an hour", "a few ~
$ host_response_rate <chr> "100%", "40%", "100%", "N~
$ host_acceptance_rate <chr> "91%", "100%", "N/A", "0%~
$ host_is_superhost <lgl> TRUE, FALSE, TRUE, FALSE,~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Mitte", "Prenzlauer Berg~
$ host_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_total_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood <chr> "Berlin, Germany", "Berli~
$ neighbourhood_cleansed <chr> "Brunnenstr. Süd", "Prenz~
$ neighbourhood_group_cleansed <chr> "Mitte", "Pankow", "Panko~
$ latitude <dbl> 52.53305, 52.53471, 52.54~
$ longitude <dbl> 13.40394, 13.41810, 13.41~
$ property_type <chr> "Entire guesthouse", "Ent~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 2, 4, 2, 7, 1, 5, 2, 4, 4~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 sh~
$ bedrooms <dbl> 1, 1, 1, 4, NA, 1, NA, 2,~
$ beds <dbl> 0, 2, 2, 7, 1, 3, 0, 2, 2~
$ amenities <chr> "[\"Refrigerator\", \"Hea~
$ price <chr> "$77.00", "$90.00", "$33.~
$ minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_minimum_nights <dbl> 33, 62, 1, 6, 90, 60, 5, ~
$ maximum_minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ minimum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ maximum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_nights_avg_ntm <dbl> 88.2, 62.0, 1.0, 6.0, 90.~
$ maximum_nights_avg_ntm <dbl> 1125.0, 1125.0, 10.0, 14.~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 0, 9, 0, 0, 0, 0, 0, 3, 0~
$ availability_60 <dbl> 21, 9, 0, 0, 1, 0, 4, 31,~
$ availability_90 <dbl> 51, 9, 0, 0, 31, 0, 4, 61~
$ availability_365 <dbl> 326, 93, 0, 0, 102, 144, ~
$ calendar_last_scraped <date> 2021-09-22, 2021-09-22, ~
$ number_of_reviews <dbl> 143, 147, 293, 8, 26, 48,~
$ number_of_reviews_ltm <dbl> 10, 1, 0, 0, 1, 0, 21, 2,~
$ number_of_reviews_l30d <dbl> 1, 0, 0, 0, 0, 0, 3, 0, 0~
$ first_review <date> 2016-04-11, 2010-12-21, ~
$ last_review <date> 2021-07-22, 2017-03-20, ~
$ review_scores_rating <dbl> 4.66, 4.63, 4.83, 5.00, 4~
$ review_scores_accuracy <dbl> 4.79, 4.68, 4.85, 5.00, 5~
$ review_scores_cleanliness <dbl> 4.52, 4.53, 4.90, 5.00, 4~
$ review_scores_checkin <dbl> 4.88, 4.64, 4.86, 5.00, 4~
$ review_scores_communication <dbl> 4.89, 4.69, 4.85, 5.00, 4~
$ review_scores_location <dbl> 4.96, 4.92, 4.91, 4.86, 4~
$ review_scores_value <dbl> 4.59, 4.63, 4.71, 4.86, 4~
$ license <chr> NA, NA, NA, "03/Z/RA/0034~
$ instant_bookable <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count <dbl> 5, 1, 1, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes <dbl> 5, 1, 0, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 2.15, 1.12, 2.40, 0.16, 0~
Our dataset has 18,288 observations and 74 variables. It includes details about Airbnb listings in Berlin.
listings %>%
skim()| Name | Piped data |
| Number of rows | 18288 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 25 |
| Date | 5 |
| logical | 7 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 18288 | 0 |
| name | 29 | 1.00 | 1 | 255 | 0 | 17766 | 0 |
| description | 544 | 0.97 | 1 | 1000 | 0 | 17156 | 0 |
| neighborhood_overview | 8702 | 0.52 | 1 | 1000 | 0 | 8570 | 0 |
| picture_url | 0 | 1.00 | 60 | 126 | 0 | 18047 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 14776 | 0 |
| host_name | 16 | 1.00 | 1 | 35 | 0 | 5177 | 0 |
| host_location | 59 | 1.00 | 1 | 199 | 0 | 952 | 0 |
| host_about | 9327 | 0.49 | 1 | 5095 | 0 | 6642 | 21 |
| host_response_time | 16 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 16 | 1.00 | 2 | 4 | 0 | 66 | 0 |
| host_acceptance_rate | 16 | 1.00 | 2 | 4 | 0 | 99 | 0 |
| host_thumbnail_url | 16 | 1.00 | 55 | 106 | 0 | 14674 | 0 |
| host_picture_url | 16 | 1.00 | 57 | 109 | 0 | 14674 | 0 |
| host_neighbourhood | 6091 | 0.67 | 1 | 28 | 0 | 165 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 318 | 0 |
| neighbourhood | 8702 | 0.52 | 7 | 43 | 0 | 50 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 4 | 41 | 0 | 137 | 0 |
| neighbourhood_group_cleansed | 0 | 1.00 | 5 | 24 | 0 | 12 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 68 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 26 | 1.00 | 6 | 17 | 0 | 27 | 0 |
| amenities | 0 | 1.00 | 2 | 1416 | 0 | 15257 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 430 | 0 |
| license | 16019 | 0.12 | 3 | 342 | 0 | 1921 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.0 | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 |
| host_since | 16 | 1.0 | 2008-08-08 | 2021-09-20 | 2015-09-16 | 3562 |
| calendar_last_scraped | 0 | 1.0 | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 |
| first_review | 3572 | 0.8 | 2010-12-21 | 2021-09-22 | 2018-07-10 | 2771 |
| last_review | 3572 | 0.8 | 2012-07-08 | 2021-09-26 | 2019-09-28 | 2226 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 16 | 1 | 0.15 | FAL: 15448, TRU: 2824 |
| host_has_profile_pic | 16 | 1 | 0.99 | TRU: 18178, FAL: 94 |
| host_identity_verified | 16 | 1 | 0.79 | TRU: 14412, FAL: 3860 |
| bathrooms | 18288 | 0 | NaN | : |
| calendar_updated | 18288 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.98 | TRU: 17929, FAL: 359 |
| instant_bookable | 0 | 1 | 0.30 | FAL: 12737, TRU: 5551 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.557156e+07 | 15400105.68 | 2.015000e+03 | 1.218794e+07 | 2.385470e+07 | 3.968697e+07 | 5.238006e+07 | <U+2587><U+2587><U+2587><U+2586><U+2587> |
| scrape_id | 0 | 1.00 | 2.021092e+13 | 0.00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 9.337946e+07 | 108308811.44 | 1.581000e+03 | 1.194556e+07 | 4.352120e+07 | 1.449065e+08 | 4.238179e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 16 | 1.00 | 4.560000e+00 | 40.36 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 16 | 1.00 | 4.560000e+00 | 40.36 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.251000e+01 | 0.03 | 5.234000e+01 | 5.249000e+01 | 5.251000e+01 | 5.253000e+01 | 5.266000e+01 | <U+2581><U+2581><U+2587><U+2583><U+2581> |
| longitude | 0 | 1.00 | 1.341000e+01 | 0.06 | 1.310000e+01 | 1.337000e+01 | 1.341000e+01 | 1.344000e+01 | 1.376000e+01 | <U+2581><U+2582><U+2587><U+2581><U+2581> |
| accommodates | 0 | 1.00 | 2.710000e+00 | 1.62 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| bedrooms | 1609 | 0.91 | 1.270000e+00 | 0.63 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 227 | 0.99 | 1.620000e+00 | 1.24 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.700000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 9.320000e+00 | 34.24 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 5.883000e+02 | 528.30 | 1.000000e+00 | 2.800000e+01 | 3.650000e+02 | 1.125000e+03 | 5.000000e+03 | <U+2587><U+2587><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 1 | 1.00 | 9.220000e+00 | 34.15 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 1 | 1.00 | 9.880000e+00 | 35.35 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 1 | 1.00 | 4.704186e+05 | 31757979.28 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 1 | 1.00 | 5.878616e+05 | 35505529.04 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 9.600000e+00 | 34.52 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 5.875923e+05 | 35489477.79 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 3.810000e+00 | 7.72 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_60 | 0 | 1.00 | 1.019000e+01 | 17.76 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_90 | 0 | 1.00 | 1.832000e+01 | 29.27 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.500000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_365 | 0 | 1.00 | 8.556000e+01 | 124.51 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.620000e+02 | 3.650000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| number_of_reviews | 0 | 1.00 | 2.279000e+01 | 51.02 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 | 1.700000e+01 | 6.550000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 2.680000e+00 | 9.36 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 4.470000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_l30d | 0 | 1.00 | 4.500000e-01 | 1.65 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.040000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 3572 | 0.80 | 4.630000e+00 | 0.80 | 0.000000e+00 | 4.610000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 3897 | 0.79 | 4.790000e+00 | 0.41 | 0.000000e+00 | 4.750000e+00 | 4.920000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 3895 | 0.79 | 4.640000e+00 | 0.53 | 0.000000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 3909 | 0.79 | 4.830000e+00 | 0.39 | 0.000000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 3898 | 0.79 | 4.830000e+00 | 0.40 | 0.000000e+00 | 4.810000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 3908 | 0.79 | 4.760000e+00 | 0.38 | 0.000000e+00 | 4.670000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 3910 | 0.79 | 4.670000e+00 | 0.45 | 0.000000e+00 | 4.550000e+00 | 4.760000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 3.030000e+00 | 7.45 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 7.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.940000e+00 | 5.42 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.400000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 8.900000e-01 | 3.25 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.400000e-01 | 2.02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 3572 | 0.80 | 8.200000e-01 | 1.58 | 1.000000e-02 | 9.000000e-02 | 3.000000e-01 | 1.000000e+00 | 9.086000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Using the skim function, we can see that we have 25 character variables, 5 date variables, 7 logical variables and 37 numeric variables. While many variables have no missing values, some variables like ‘host_about’, ‘host_neighbourhood’, ‘neighbourhood’ and ‘license’ have many missing observations. In fact, ‘bathrooms’ and ‘calendar_update’ have no observations at all.
We will now list the variable names by type:
listings %>%
skim() %>%
filter(skim_type=="numeric") %>%
select(skim_variable)| skim_variable |
|---|
| id |
| scrape_id |
| host_id |
| host_listings_count |
| host_total_listings_count |
| latitude |
| longitude |
| accommodates |
| bedrooms |
| beds |
| minimum_nights |
| maximum_nights |
| minimum_minimum_nights |
| maximum_minimum_nights |
| minimum_maximum_nights |
| maximum_maximum_nights |
| minimum_nights_avg_ntm |
| maximum_nights_avg_ntm |
| availability_30 |
| availability_60 |
| availability_90 |
| availability_365 |
| number_of_reviews |
| number_of_reviews_ltm |
| number_of_reviews_l30d |
| review_scores_rating |
| review_scores_accuracy |
| review_scores_cleanliness |
| review_scores_checkin |
| review_scores_communication |
| review_scores_location |
| review_scores_value |
| calculated_host_listings_count |
| calculated_host_listings_count_entire_homes |
| calculated_host_listings_count_private_rooms |
| calculated_host_listings_count_shared_rooms |
| reviews_per_month |
listings %>%
skim() %>%
filter(skim_type=="logical") %>%
select(skim_variable)| skim_variable |
|---|
| host_is_superhost |
| host_has_profile_pic |
| host_identity_verified |
| bathrooms |
| calendar_updated |
| has_availability |
| instant_bookable |
listings %>%
skim() %>%
filter(skim_type=="character") %>%
select(skim_variable)| skim_variable |
|---|
| listing_url |
| name |
| description |
| neighborhood_overview |
| picture_url |
| host_url |
| host_name |
| host_location |
| host_about |
| host_response_time |
| host_response_rate |
| host_acceptance_rate |
| host_thumbnail_url |
| host_picture_url |
| host_neighbourhood |
| host_verifications |
| neighbourhood |
| neighbourhood_cleansed |
| neighbourhood_group_cleansed |
| property_type |
| room_type |
| bathrooms_text |
| amenities |
| price |
| license |
host_response_rate, host_acceptance_rate, and price have been categorized as character strings instead of numeric variables.
Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number. We will also convert host_response_rate and host_acceptance_rate.
listings <- listings %>%
mutate(price = parse_number(price),
host_response_rate = parse_number(host_response_rate)/100,
host_acceptance_rate = parse_number(host_acceptance_rate)/100
)
# We check the new type of the three variables ...
typeof(listings$price)[1] "double"
typeof(listings$host_response_rate)[1] "double"
typeof(listings$host_acceptance_rate) #Price, host response rate and host acceptance rate are now stored as numbers for the training set[1] "double"
# ... and take a closer look at the variables
listings %>%
select(price, host_response_rate, host_acceptance_rate) %>%
glimpse()Rows: 18,288
Columns: 3
$ price <dbl> 77, 90, 33, 180, 70, 90, 49, 169, 70, 65, 93, 65,~
$ host_response_rate <dbl> 1.00, 0.40, 1.00, NA, 1.00, 1.00, 1.00, NA, 1.00,~
$ host_acceptance_rate <dbl> 0.91, 1.00, NA, 0.00, 0.56, NA, 0.98, NA, NA, 0.7~
Now we will look at the variables in more detail and construct our dependent variable of interest: price per 4 nights for 2 people.
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. However, some properties might be intended for other purposes. We only want to include listings in our regression analysis that are intended for travel purposes. In order to do this, we need to look at minimum_nights in more detail:
# First, we want to see what are the most common values for minimum_nights
listings %>%
count(minimum_nights) %>%
arrange(desc(n))| minimum_nights | n |
|---|---|
| 2 | 4236 |
| 1 | 4194 |
| 3 | 3282 |
| 4 | 1368 |
| 5 | 1293 |
| 7 | 864 |
| 30 | 418 |
| 6 | 382 |
| 14 | 363 |
| 60 | 298 |
| 10 | 284 |
| 90 | 195 |
| 20 | 117 |
| 28 | 95 |
| 15 | 82 |
| 8 | 75 |
| 21 | 72 |
| 180 | 53 |
| 12 | 43 |
| 9 | 39 |
| 25 | 37 |
| 13 | 33 |
| 29 | 32 |
| 61 | 31 |
| 62 | 28 |
| 22 | 26 |
| 120 | 23 |
| 31 | 17 |
| 183 | 16 |
| 45 | 14 |
| 93 | 13 |
| 18 | 11 |
| 150 | 11 |
| 16 | 10 |
| 89 | 10 |
| 91 | 10 |
| 40 | 9 |
| 58 | 9 |
| 100 | 9 |
| 357 | 9 |
| 11 | 7 |
| 19 | 7 |
| 50 | 7 |
| 56 | 7 |
| 365 | 7 |
| 23 | 6 |
| 27 | 6 |
| 181 | 6 |
| 65 | 5 |
| 92 | 5 |
| 200 | 5 |
| 1e+03 | 5 |
| 17 | 4 |
| 55 | 4 |
| 63 | 4 |
| 70 | 4 |
| 80 | 4 |
| 85 | 4 |
| 300 | 4 |
| 24 | 3 |
| 42 | 3 |
| 59 | 3 |
| 99 | 3 |
| 118 | 3 |
| 182 | 3 |
| 186 | 3 |
| 500 | 3 |
| 1.12e+03 | 3 |
| 26 | 2 |
| 33 | 2 |
| 35 | 2 |
| 83 | 2 |
| 84 | 2 |
| 140 | 2 |
| 185 | 2 |
| 240 | 2 |
| 360 | 2 |
| 34 | 1 |
| 37 | 1 |
| 48 | 1 |
| 49 | 1 |
| 51 | 1 |
| 71 | 1 |
| 75 | 1 |
| 82 | 1 |
| 87 | 1 |
| 88 | 1 |
| 98 | 1 |
| 101 | 1 |
| 105 | 1 |
| 119 | 1 |
| 122 | 1 |
| 125 | 1 |
| 128 | 1 |
| 129 | 1 |
| 170 | 1 |
| 179 | 1 |
| 184 | 1 |
| 187 | 1 |
| 188 | 1 |
| 210 | 1 |
| 250 | 1 |
| 270 | 1 |
| 304 | 1 |
| 355 | 1 |
| 356 | 1 |
| 720 | 1 |
| 1.1e+03 | 1 |
# We also look at the range of values for minimum_nights
favstats(~minimum_nights, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 5 | 1.12e+03 | 9.32 | 34.2 | 18288 | 0 |
As we predicted, most properties on Airbnb seem to be meant for travelers. The most common values are 2,1 and 3 nights, which would be reasonable for a tourist. We can also see, however, that 30 and 60 day minimum stay are among the 10 most common type of properties. We can guess these properties are meant as a substitute to renting from an agent or a landlord. In other words, these properties are meant for long-term stay and are not appropriate for our analysis. Additionally, we can see that the median minimum stay requirement is 3 nights, where as the mean is around 9. The discrepancy comes from the large outliers we see in the data. For example, one property has a minimum stay of 1,124 nights, which heavily skews our data.
To deal with this, we will filter the data to include only properties that have a minimum stay of 4 nights or less.
listings <- listings %>%
filter(minimum_nights <= 4)
favstats(~minimum_nights, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2 | 3 | 4 | 2.14 | 0.985 | 13080 | 0 |
listings %>%
ggplot(aes(x = minimum_nights)) +
geom_histogram(fill = "light blue",
colour = "grey") +
labs(title = "Distribution of Minimum Stay Requirements",
x = "Minimum Nights Stay Requirement",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()So far we have filtered the data to include only the listings, which offer a stay with a minimum length of 1-4 nights. However, it is possible that these include apartments which offer maximum stays of 1-3 days. We need to remove these listings and keeping only those that have maximum stays of 4 nights or more.
listings %>%
filter(maximum_nights<=3) %>%
count() | n |
|---|
| 256 |
listings <- listings %>%
filter(maximum_nights >= 4)
favstats(~maximum_nights, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 4 | 25 | 1.12e+03 | 1.12e+03 | 5e+03 | 621 | 532 | 12824 | 0 |
listings %>%
ggplot(aes(x = maximum_nights)) +
geom_histogram(fill = "light blue",
colour = "grey") +
labs(title = "Distribution of Maximum Stay Requirements",
x = "Maximum Nights Stay Requirement",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() We have now removed the 256 listings that had a stay requirement that was too short.
Since we are considering apartments for 2, we also need to filter the data to include only apartments which accommodate at least 2 people. First let’s look at summary statistics of the accommodates variable.
favstats(~accommodates, data=listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 3 | 16 | 2.77 | 1.71 | 12824 | 0 |
There is also an apartment in the dataset which claims to accommodate 16 people. It’s possible that this is an outlier so we will plot a histogram and boxplot to determine whether this is the case.
listings %>%
ggplot(aes(x = accommodates)) +
geom_histogram(fill = "light blue",
color = "grey") +
labs(title = "Distribution of people accommodated per Airbnb Apartment in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of people accommodated",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The distribution has a positive skew.
listings %>%
ggplot(aes(x = accommodates)) +
geom_boxplot(fill="light blue") +
labs(title = "Distribution of people accommodated per Airbnb Apartment in Berlin",
x = "Number of people accommodated",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The boxplot shows that there are several outliers. Let’s check how many points are classified as outliers and remove some of them.
accommodates_outliers<-boxplot(listings$accommodates)$outlength(accommodates_outliers) #There are a fair number of 'outliers'. Let's look at these more closely.[1] 1318
accommodates_outliers %>% table() #There are a fairly large number of apartments which accommodate 8 people. .
0 5 6 7 8 9 10 11 12 13 14 15 16
6 419 501 93 129 34 52 11 28 10 13 5 17
#We will only discard apartments which accommodate 9 or more people.
listings<-listings %>%
mutate(accommodates = if_else(accommodates>=9,NA_real_,accommodates)
)Since we are interested only in apartments for 2 people, we keep only the listings that accommodate at least 2 people.
listings<-listings %>%
filter(accommodates>=2)Now that we have filtered the data, let’s look at our dependent variable price in more detail:
favstats(~price, data = listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 39 | 60 | 90 | 8e+03 | 78.8 | 118 | 11251 | 0 |
We can see that none of the price observations are missing, however the minimum price per night is 0, which seems suspicious. We will investigate this further and see how many observations have a price of 0.
listings %>%
filter(price==0) %>%
count()| n |
|---|
| 2 |
Next we are going to create a new variable price_4_nights, which represents the price we would have to pay to stay at an Airbnb property for 4 nights. We also exclude any properties with negative or 0 prices.
listings <- listings %>%
filter(price > 0) %>%
mutate(price_4_nights = 4*price)It is important to know the distribution of our dependent variable. To check this, we will plot a histogram of price_4_nights with the code below:
listings %>%
ggplot(aes(x = price_4_nights)) +
geom_histogram(fill = "light blue", colour = "grey") +
labs(title = "Distribution of Prices of Airbnb Apartment in Berlin for 4 nights",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Price per 4 nights",
y="Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() As we can see, the data on prices is heavily right-skewed. This can hurt the interpretative power of our model and we would need to address it before continuing with our regression analysis.
listings %>%
ggplot(aes(x = price_4_nights)) +
geom_boxplot(color="light blue") +
labs(title = "Distribution of Prices of Airbnb Apartment in Berlin for 4 nights",
subtitle = "Data includes only apartments that accommodate at least 2 people",
caption="Data from: http://insideairbnb.com/get-the-data.html",
x = "Price per 4 Nights") +
theme_bw()Additionally, we can see from the boxplot that there are some outliers. This will be addressed below.
To deal with the distribution of prices, we will transform the price_4_nights using a log function.
listings <- listings %>%
mutate(log_price_4_nights = log(price_4_nights))
listings %>%
ggplot(aes(x = log_price_4_nights)) +
geom_histogram(fill = "light blue",
colour = "grey") +
geom_vline(aes(xintercept = median(log_price_4_nights)),
color ="black",
linetype ="dashed",
size = 1)+
labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
subtitle = "Data includes only apartments that accomodate at least 2 people",
x = "Log of Price per 4 nights",
y = "",
caption = "Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()listings %>%
ggplot(aes(x = log_price_4_nights)) +
geom_density(fill = "light blue") +
geom_vline(aes(xintercept = median(log_price_4_nights)),
color = "black",
linetype="dashed",
size=1) +
labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
subtitle = "Data includes only apartments that accomodate at least 2 people",
x = "Log of Price per 4 nights",
y = "",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() As we can see, applying a log transformation made the distribution of prices much more like a normal distribution, although there is still a slight right-skew.
listings %>%
ggplot(aes(x = log_price_4_nights)) +
geom_boxplot(fill="light blue") +
labs(title = "Distribution of Log Prices of Airbnb Apartment in Berlin for 4 nights",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x="Log of Price per 4 Nights",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() Once we log transform the price variable, there are still several outliers. We will identify these and remove them from our dataset.
logprice_outliers <- boxplot(listings$log_price_4_nights)$outlistings<-listings %>%
mutate(log_price_4_nights = if_else(log_price_4_nights %in% logprice_outliers,
NA_real_,
log_price_4_nights
)
)Next, we look at the variable property_type and use the count function to determine how many categories there are, their frequency and their proportions.
listings %>%
count(property_type, sort=TRUE) %>%
count() #There are 57 categories.| n |
|---|
| 57 |
property_counts <- listings %>%
count(property_type, sort=TRUE) %>%
mutate(sum_n=sum(n)) %>%
slice_max(order_by=n, n=4) %>%
mutate(prop=n/sum_n)
property_counts| property_type | n | sum_n | prop |
|---|---|---|---|
| Entire rental unit | 5314 | 11249 | 0.472 |
| Private room in rental unit | 4032 | 11249 | 0.358 |
| Entire condominium (condo) | 283 | 11249 | 0.0252 |
| Entire serviced apartment | 227 | 11249 | 0.0202 |
property_counts %>%
summarise(total_prop = sum(prop))| total_prop |
|---|
| 0.876 |
Since the vast majority of the observations in the data are one of the top four or five property types, we create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit",
"Private room in rental unit",
"Entire condominium (condo)",
"Entire serviced apartment") ~ property_type,
TRUE ~ "Other"
)
)We use the code below to check that prop_type_simplified was correctly made.
listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 5314 |
| Private room in rental unit | Private room in rental unit | 4032 |
| Entire condominium (condo) | Entire condominium (condo) | 283 |
| Entire serviced apartment | Entire serviced apartment | 227 |
| Entire loft | Other | 201 |
| Private room in residential home | Other | 160 |
| Room in hotel | Other | 134 |
| Private room in condominium (condo) | Other | 130 |
| Entire residential home | Other | 104 |
| Room in boutique hotel | Other | 78 |
| Private room in bed and breakfast | Other | 54 |
| Private room in loft | Other | 54 |
| Entire guesthouse | Other | 49 |
| Shared room in hostel | Other | 49 |
| Private room in townhouse | Other | 43 |
| Shared room in rental unit | Other | 43 |
| Private room in hostel | Other | 30 |
| Room in aparthotel | Other | 29 |
| Room in serviced apartment | Other | 28 |
| Entire guest suite | Other | 23 |
| Private room in serviced apartment | Other | 19 |
| Houseboat | Other | 16 |
| Entire bungalow | Other | 15 |
| Private room | Other | 14 |
| Private room in pension | Other | 11 |
| Entire townhouse | Other | 10 |
| Private room in guesthouse | Other | 9 |
| Boat | Other | 8 |
| Tiny house | Other | 8 |
| Camper/RV | Other | 7 |
| Private room in guest suite | Other | 7 |
| Private room in casa particular | Other | 5 |
| Private room in villa | Other | 5 |
| Room in hostel | Other | 5 |
| Entire cottage | Other | 4 |
| Entire place | Other | 4 |
| Private room in boat | Other | 4 |
| Shared room in boutique hotel | Other | 4 |
| Entire cabin | Other | 3 |
| Entire villa | Other | 3 |
| Shared room in condominium (condo) | Other | 3 |
| Private room in bungalow | Other | 2 |
| Private room in cottage | Other | 2 |
| Room in bed and breakfast | Other | 2 |
| Treehouse | Other | 2 |
| Castle | Other | 1 |
| Entire chalet | Other | 1 |
| Entire home/apt | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in floor | Other | 1 |
| Private room in houseboat | Other | 1 |
| Private room in tiny house | Other | 1 |
| Private room in tipi | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in loft | Other | 1 |
| Shared room in townhouse | Other | 1 |
| Shipping container | Other | 1 |
We also make a bar plot.
listings %>%
ggplot(aes(y=fct_rev(fct_infreq(prop_type_simplified)))) +
geom_bar(fill = "light blue")+
labs(
title="Bar Plot of Simplified Property Type",
y="Property Type (Simplified)",
x="Count"
)+
NULLNow, we can see that the data is much more evenly distributed.
The bathrooms data is all missing, so we will construct a bathrooms variable using bathrooms_text.
listings<-listings %>%
mutate(bathrooms_num=parse_number(bathrooms_text))We will now look at the distribution of the bathrooms_num variable.
favstats(~bathrooms_num, data=listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 8 | 1.1 | 0.311 | 11202 | 47 |
listings %>%
ggplot(aes(x = bathrooms_num)) +
geom_histogram(fill = "light blue",
colour = "grey") +
labs(title = "Distribution of Bathrooms of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of bathrooms",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()listings %>%
filter(bathrooms_num==0) %>%
count()| n |
|---|
| 42 |
Surprisingly, some apartments have no bathrooms. Perhaps Airbnb classifies toilets and bathrooms separately, or these apartments have access to public toilets in the vicinity.
The data also looks right-skewed. We will make a boxplot to determine whether these extreme points are outliers.
listings %>%
ggplot(aes(x = bathrooms_num)) +
geom_boxplot(color="light blue") +
labs(title = "Boxplot of Bathrooms of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of bathrooms",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()bathrooms_outliers<-boxplot(listings$bathrooms_num)$outlength(bathrooms_outliers) #There are a large number of 'outliers'. Let's look at these more closely.[1] 1465
bathrooms_outliers %>% table() #Some of these listings have been classified as outliers despite having a reasonable number of bathrooms. However it is unlikely that 2 .
0 1.5 2 2.5 3 3.5 4 4.5 8
42 739 578 77 18 3 4 3 1
#We will only discard apartments with 11 or more beds.
listings<-listings %>%
mutate(beds = if_else(beds>=11,NA_real_,beds)
)Some of these listings have been classified as outliers despite having a reasonable number of bathrooms. However, 8 bathrooms is questionable. It is unlikely that 2 people would require an 8 bathroom apartment for a 4-night stay, so for our analysis we will remove this entry.
listings<-listings %>%
mutate(bathrooms_num = if_else(bathrooms_num>=8,NA_real_,bathrooms_num)
)We will now look at the distribution of the beds variable.
favstats(~beds, data=listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 10 | 1.63 | 1.13 | 11119 | 130 |
listings %>%
ggplot(aes(x = beds)) +
geom_histogram(fill = "light blue") +
labs(title = "Distribution of Beds of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of beds",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The data is right-skewed. There is even an apartment in the dataset which claim to have 10 beds. While this is possible, they may also be outliers. We will look at the boxplot.
listings %>%
ggplot(aes(x = beds)) +
geom_boxplot(fill="light blue") +
labs(title = "Boxplot of Beds of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of beds",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The boxplot confirms that the data is right-skewed. Next we check for and remove some outliers.
beds_outliers<-boxplot(listings$beds)$outlength(beds_outliers) #There are a large number of 'outliers'. Let's look at these more closely.[1] 768
beds_outliers %>% table() #There are still a large number of apartments with 4-8 beds, so we will not discard these..
4 5 6 7 8 9 10
429 153 130 26 28 1 1
#We will only discard apartments with 9 or more beds.
listings<-listings %>%
mutate(beds = if_else(beds>=9,NA_real_,beds)
)We now perform the same analysis on the bedrooms variable.
favstats(~bedrooms, data=listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 10 | 1.24 | 0.554 | 10346 | 903 |
listings %>%
ggplot(aes(x = bedrooms)) +
geom_histogram(fill = "light blue") +
labs(title = "Distribution of Bedrooms of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of bedrooms",
y = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()Once again, the data is heavily right-skewed.
listings %>%
ggplot(aes(x = bedrooms)) +
geom_boxplot(fill = "light blue") +
labs(title = "Boxplot of Bedrooms of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of bedrooms",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() This skewness is confirmed in the boxplot.
bedrooms_outliers<-boxplot(listings$bedrooms)$outlength(bedrooms_outliers) #There are a large number of 'outliers'. Let's look at these more closely.[1] 2019
bedrooms_outliers %>% table() #Even apartments with two bedrooms have been categorised as outliers, despite there being a large number in the dataset..
2 3 4 5 6 10
1601 351 57 4 5 1
#We will only discard apartments with 10 or more bedrooms.
listings<-listings %>%
mutate(bedrooms = if_else(bedrooms>=10,NA_real_,bedrooms)
)Next, we will look at the variable number_of_reviews in more detail:
listings %>%
ggplot(aes(x = number_of_reviews)) +
geom_density(fill = "light blue") +
labs(title = "Distribution of Reviews of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of reviews",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The distribution is heavily right-skewed.
Next we will look at the outliers.
listings %>%
ggplot(aes(x = number_of_reviews)) +
geom_boxplot(fill = "light blue") +
labs(title = "Boxplot of Reviews of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Number of reviews",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()numreviews_outliers<-boxplot(listings$number_of_reviews)$outlength(numreviews_outliers)[1] 1567
#The boxplot indicates that there are a huge number of outliers.The boxplot indicates that there are a huge number of technical outliers. We will not remove these. Perhaps some apartments have more reviews than others because they are in extremely popular locations, or perhaps they are larger apartments and each guest filled out a review.
Similarly, we will also consider the number of reviews per month.
listings %>%
ggplot(aes(x = reviews_per_month)) +
geom_density(fill = "light blue") +
labs(title = "Distribution of Reviews per month of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Reviews per month",
y = "Density",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() As expected, this follows a similar shape to
number_of_reviews.
listings %>%
ggplot(aes(x = reviews_per_month)) +
geom_boxplot(fill = "light blue") +
labs(title = "Boxplot of Reviews of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Reviews per month",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw()numreviews_outliers<-boxplot(listings$reviews_per_month)$outlength(numreviews_outliers)[1] 767
#The boxplot indicates that there are a huge number of outliers.Once again, there are a huge number of technical outliers, which we will not remove.
There are several measures of review scores in the dataset. To compare these, we start by pivoting the data into long format.
reviews_long<-listings %>%
select(contains("review_score")) %>%
pivot_longer(
cols=1:ncol(.),
names_to="review_type",
values_to="review_scores")Next we look at summary statistics of the variables
favstats(review_scores~review_type, data=reviews_long)| review_type | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| review_scores_accuracy | 0 | 4.75 | 4.91 | 5 | 5 | 4.79 | 0.39 | 9418 | 1831 |
| review_scores_checkin | 0 | 4.8 | 4.95 | 5 | 5 | 4.83 | 0.373 | 9414 | 1835 |
| review_scores_cleanliness | 0 | 4.5 | 4.8 | 5 | 5 | 4.64 | 0.503 | 9421 | 1828 |
| review_scores_communication | 0 | 4.8 | 4.96 | 5 | 5 | 4.83 | 0.372 | 9420 | 1829 |
| review_scores_location | 0 | 4.67 | 4.86 | 5 | 5 | 4.76 | 0.364 | 9414 | 1835 |
| review_scores_rating | 0 | 4.61 | 4.83 | 5 | 5 | 4.65 | 0.714 | 9576 | 1673 |
| review_scores_value | 0 | 4.55 | 4.75 | 5 | 5 | 4.67 | 0.426 | 9412 | 1837 |
review_scores_rating and reviews_scores_cleanliness have higher standard deviations.
Next, we plot these variables and compare distributions.
reviews_long %>%
ggplot(aes(x = review_scores, fill=review_type)) +
geom_density() +
facet_wrap(~review_type, ncol=3)+
labs(title = "Distribution of Review Scores of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Review Scores",
y = "Density",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() +
theme(legend.position="none") At first glance, the variables look fairly similar, with most reviews falling in the range 4-5.
listings %>%
select(contains("review_score")) %>%
ggpairs()The correlation plot further confirms that there is some degree of correlation between the variables.
Next, we want to explore the 3 variables related to neighborhoods neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed to understand which variable describes better the actual neighbourhoods in Berlin.
listings %>%
select("neighbourhood","neighbourhood_cleansed","neighbourhood_group_cleansed") %>%
head()| neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed |
|---|---|---|
| Berlin, Germany | Helmholtzplatz | Pankow |
| Berlin, Germany | Prenzlauer Berg Südwest | Pankow |
| Berlin, Germany | Brunnenstr. Süd | Mitte |
| Berlin, Germany | Helmholtzplatz | Pankow |
| Moabit Ost | Mitte | |
| Berlin, Germany | Brunnenstr. Süd | Mitte |
Next, we want to see how many different neighborhoods we have in the dataset and how frequently they appear. We will be looking at the neighbourhood_group_cleansed variable.
listings %>%
count(neighbourhood_group_cleansed, sort=TRUE) %>%
count() #There are 12 categories.| n |
|---|
| 12 |
neighbourhood_counts <- listings %>%
count(neighbourhood_group_cleansed, sort=TRUE) %>%
mutate(sum_n=sum(n)) %>%
slice_max(order_by=n, n=6) %>%
mutate(prop=n/sum_n)
neighbourhood_counts| neighbourhood_group_cleansed | n | sum_n | prop |
|---|---|---|---|
| Friedrichshain-Kreuzberg | 2611 | 11249 | 0.232 |
| Mitte | 2427 | 11249 | 0.216 |
| Pankow | 1804 | 11249 | 0.16 |
| Neukölln | 1549 | 11249 | 0.138 |
| Charlottenburg-Wilm. | 860 | 11249 | 0.0765 |
| Tempelhof - Schöneberg | 698 | 11249 | 0.062 |
neighbourhood_counts %>%
summarise(total_prop = sum(prop))| total_prop |
|---|
| 0.884 |
We have a total of 12 neighbourhoods, however 6 of them represent 88.4% of all the observations. From our knowledge of Berlin, these 6 neighbourhoods are indeed the most popular ones while the others are more on the outskirts of the city, therefore, we will group the other 6 ones into the “Other” category.
listings <- listings %>%
mutate(neighbourhood_simplified = case_when(
neighbourhood_group_cleansed %in% c("Friedrichshain-Kreuzberg",
"Mitte",
"Pankow",
"Neukölln",
"Charlottenburg-Wilm.",
"Tempelhof - Schöneberg") ~ neighbourhood_group_cleansed,
TRUE ~ "Other"
))We can check our output below.
listings %>%
count(neighbourhood_group_cleansed, neighbourhood_simplified) %>%
arrange(desc(n))| neighbourhood_group_cleansed | neighbourhood_simplified | n |
|---|---|---|
| Friedrichshain-Kreuzberg | Friedrichshain-Kreuzberg | 2611 |
| Mitte | Mitte | 2427 |
| Pankow | Pankow | 1804 |
| Neukölln | Neukölln | 1549 |
| Charlottenburg-Wilm. | Charlottenburg-Wilm. | 860 |
| Tempelhof - Schöneberg | Tempelhof - Schöneberg | 698 |
| Treptow - Köpenick | Other | 387 |
| Lichtenberg | Other | 379 |
| Steglitz - Zehlendorf | Other | 217 |
| Reinickendorf | Other | 155 |
| Marzahn - Hellersdorf | Other | 86 |
| Spandau | Other | 76 |
listings %>%
ggplot(aes(y = fct_rev(fct_infreq(neighbourhood_simplified)))) +
geom_bar(fill = "light blue") +
labs(title = "Distribution of Airbnb Apartments in Berlinof Neighbourhoods ",
subtitle = "Data includes only apartments that accommodate at least 2 people",
y = "Neighbourhood",
x = "Count",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() ### Superhost
In our analysis, we will also consider whether being a host_is_superhost is an important factor in determining the price of a 4-night stay. Let’s look at the proportions of listings which fall into each category.
listings %>%
count(host_is_superhost) %>%
mutate(total=sum(n),
prop=n/total)| host_is_superhost | n | total | prop |
|---|---|---|---|
| FALSE | 9280 | 11249 | 0.825 |
| TRUE | 1957 | 11249 | 0.174 |
| 12 | 11249 | 0.00107 |
Finally, we consider the availability_30 variable. We expect this variable to range from 0-30, corresponding to the number of the days the apartment is available in the next 30 days. This variable may be a good predictor of price as it indicates the relative demand for an apartment - rooms which have low availability over the next 30 days are likely to have higher demand and can therefore charge higher prices.
favstats(~availability_30, data=listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 4 | 30 | 4.02 | 7.72 | 11249 | 0 |
We also consider the distribution of availability.
listings %>%
ggplot(aes(x = availability_30)) +
geom_density(fill="light blue") +
labs(title = "Distribution of 30-Day Availability of Airbnb Apartments in Berlin",
subtitle = "Data includes only apartments that accommodate at least 2 people",
x = "Availability in next 30 days",
y="Density",
caption="Data from: http://insideairbnb.com/get-the-data.html") +
theme_bw() The distribution has a strong positive skew.
Next, we will look at the correlations of continuous variables. As we want to evaluate multiple variables, first we will look at those not related to the reviews.
listings %>%
ggpairs(columns = c(
"log_price_4_nights",
"bedrooms",
"beds",
"accommodates",
"host_response_rate",
"host_listings_count",
"availability_30",
"bathrooms_num",
"calculated_host_listings_count"),
upper = list(continuous = "cor"),
lower = list(continuous = "points"))We observe that “log_price_4_nights” has the highest positive correlation with the variables: “bedrooms” (0.428), “beds” (0.374), “accommodate” (0.506), “availability_30” (0.354), “bathrooms_num” (0.137), and “calculated_host_listings_count” (0.207).
In the next step, we will evaluate correlations between the “log_price_4_nights” and reviews-related numeric variables:
listings %>%
ggpairs(columns = c(
"log_price_4_nights",
"number_of_reviews",
"review_scores_rating",
"number_of_reviews_l30d",
"reviews_per_month",
"review_scores_value",
"review_scores_cleanliness",
"review_scores_location",
"review_scores_checkin",
"review_scores_accuracy"),
upper = list(continuous = "cor"),
lower = list(continuous = "points"))We observe that
log_price_4_nightshas the highest correlation with the variables:number_of_reviews(0.094),review_scores_location(0.078),number_of_reviews_l30d(0.089),reviews_per_month(0.141),review_scores_value(-0.100). However, compared to the numerical variables evaluated above,reviews-related variables seem to have lower correlations with the price.
Next, we will separately evaluate correlations of categorical variables.
listings %>%
ggpairs(columns = c("log_price_4_nights",
"prop_type_simplified",
"room_type",
"neighbourhood_simplified"
))Looking at the boxplots, we notice that that
property_type_simplifiedandroom_typeseem to have the highest correlation with the “log_price_4_nights”.
Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. We will use the following code to show a map of your city, and overlay all Airbnb coordinates to get an overview of the spatial distribution of properties. For this visualization, we use the leaflet package, which includes a variety of tools for interactive maps and allows for easy zoom in-out, clicking on a point to get the actual listing, etc.
leaflet(data = listings) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "light blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
We have already created a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
First, we need to divide our data into a test and training subset. This will allows us to test the predictive power of our model using an out-of-sample test.
#splitting data into training and test sets
set.seed(1234) #setting a random seed to always get the same split
train_test_split <- initial_split(listings, prop = 0.80) #taking an 80-20 approach (for all the consultants out there)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)Now, our test dataset has 2250 observations and training data has 8999 observations with 79 variables each.
We will set our test data aside and do our analysis using only the training data
# Model 1 - Property Type, Number of Reviews, Review Scores
model1 <- lm(log_price_4_nights ~ prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_train)
# Since prop_type_simplified is a categorical variable with 5 possible options, the model only shows 4 coefficients. The fifth one (Entire Condo), is incorporated in the intercept
# We construct a table with the regression output
huxtable:: huxreg(list("Model1" = model1),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value'))| Model1 | |
|---|---|
| (Intercept) | 5.677 *** |
| (0.049) | |
| prop_type_simplifiedEntire rental unit | -0.132 *** |
| (0.034) | |
| prop_type_simplifiedEntire serviced apartment | 0.461 *** |
| (0.053) | |
| prop_type_simplifiedOther | -0.222 *** |
| (0.037) | |
| prop_type_simplifiedPrivate room in rental unit | -0.844 *** |
| (0.035) | |
| number_of_reviews | 0.001 *** |
| (0.000) | |
| review_scores_rating | 0.036 *** |
| (0.008) | |
| #observations | 7562 |
| R squared | 0.369 |
| Adj. R Squared | 0.368 |
| Residual SE | 0.473 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
model1 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 5.68 | 0.0489 | 116 | 0 | 5.58 | 5.77 |
| prop_type_simplifiedEntire rental unit | -0.132 | 0.0342 | -3.85 | 0.000117 | -0.199 | -0.0648 |
| prop_type_simplifiedEntire serviced apartment | 0.461 | 0.0531 | 8.68 | 4.75e-18 | 0.357 | 0.565 |
| prop_type_simplifiedOther | -0.222 | 0.037 | -6 | 2.06e-09 | -0.294 | -0.149 |
| prop_type_simplifiedPrivate room in rental unit | -0.844 | 0.0345 | -24.5 | 3.33e-127 | -0.912 | -0.777 |
| number_of_reviews | 0.000576 | 9.2e-05 | 6.26 | 4e-10 | 0.000396 | 0.000757 |
| review_scores_rating | 0.0362 | 0.00758 | 4.78 | 1.8e-06 | 0.0214 | 0.0511 |
model1 %>%
broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.369 | 0.368 | 0.473 | 736 | 0 | 6 | -5.07e+03 | 1.02e+04 | 1.02e+04 | 1.69e+03 | 7555 | 7562 |
Since the variable property_type is categorical with 5 possible options, the model only shows 4 coefficients. The fifth one (Entire Condo), is incorporated in the intercept. This means the coefficients of the other four are in relation to the fifth one. To interpret the coefficients for a log-transformed dependent variable, we are going to take the exponent of the coefficient. In this case, entire rental units are on average 12.4% cheaper than entire condos for 4 nights. Similarly, entire serviced apartments are 58.6% more expensive than entire condos, whereas private rooms in rental units are 57.0% cheaper than entire condos. Finally, other types of properties are on average 19.9% cheaper than entire condos.
Moreover, we see that one unit change in review_scores_rating corresponds to 3.7% higher price per 4 nights.
Finally, every additional review corresponds to 0.06% higher price per 4 nights. This is quite small but we also have to consider that the unit change in number_of_reviews is a single review. The maximum number of reviews on a property is 655 and we can see that if the unit change was 100, for every 100 additional reviews, the price per 4 nights would increase by 5.8%, which looks much more substantial.
# Model 2 - Model 1 + Room Type
model2 <- lm(log_price_4_nights ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = listings_train)
# Since room_type is a categorical variable with 4 possible options, the model only shows 3 coefficients. The fourth option (Entire Home/ Apartment) is incorporated in the intercept
# We construct a table with the regression output
huxtable:: huxreg(list("Model2" = model2),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value'))| Model2 | |
|---|---|
| (Intercept) | 5.713 *** |
| (0.047) | |
| prop_type_simplifiedEntire rental unit | -0.133 *** |
| (0.033) | |
| prop_type_simplifiedEntire serviced apartment | 0.459 *** |
| (0.051) | |
| prop_type_simplifiedOther | 0.074 |
| (0.041) | |
| prop_type_simplifiedPrivate room in rental unit | -0.310 *** |
| (0.048) | |
| number_of_reviews | 0.001 *** |
| (0.000) | |
| review_scores_rating | 0.029 *** |
| (0.007) | |
| room_typeHotel room | 0.440 *** |
| (0.061) | |
| room_typePrivate room | -0.535 *** |
| (0.034) | |
| room_typeShared room | -0.906 *** |
| (0.060) | |
| #observations | 7562 |
| R squared | 0.412 |
| Adj. R Squared | 0.411 |
| Residual SE | 0.457 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
model2 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 5.71 | 0.0473 | 121 | 0 | 5.62 | 5.81 |
| prop_type_simplifiedEntire rental unit | -0.133 | 0.033 | -4.01 | 6.07e-05 | -0.197 | -0.0678 |
| prop_type_simplifiedEntire serviced apartment | 0.459 | 0.0513 | 8.95 | 4.27e-19 | 0.359 | 0.56 |
| prop_type_simplifiedOther | 0.0737 | 0.0412 | 1.79 | 0.0737 | -0.00707 | 0.155 |
| prop_type_simplifiedPrivate room in rental unit | -0.31 | 0.0477 | -6.5 | 8.44e-11 | -0.404 | -0.217 |
| number_of_reviews | 0.000575 | 8.9e-05 | 6.46 | 1.13e-10 | 0.0004 | 0.000749 |
| review_scores_rating | 0.0287 | 0.00733 | 3.92 | 8.96e-05 | 0.0144 | 0.0431 |
| room_typeHotel room | 0.44 | 0.0612 | 7.2 | 6.83e-13 | 0.32 | 0.56 |
| room_typePrivate room | -0.535 | 0.0342 | -15.6 | 2.58e-54 | -0.602 | -0.468 |
| room_typeShared room | -0.906 | 0.0598 | -15.2 | 3.43e-51 | -1.02 | -0.789 |
model2 %>%
broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.412 | 0.411 | 0.457 | 587 | 0 | 9 | -4.8e+03 | 9.63e+03 | 9.7e+03 | 1.58e+03 | 7552 | 7562 |
log_price_4_nights, which is higher than Model 1. All variables are statistically significant aside from prop_type_simplifiedOther. To compare the two models, however, we need to look at the adjusted r-squared, since Model 2 includes one additional variable. Model 2 has an adjusted r-squared of 41.11% compared to 36.83% for Model 1. We can conclude that Model 2 has a higher predictive power than Model 1.
Since room_type is a categorical variable with 4 possible options, the model only shows 3 coefficients. The fourth option (Entire Home/ Apartment) is incorporated in the intercept. To interpret the coefficients for a log-transformed dependent variable, we are going to take the exponent of the coefficient. This means that hotel rooms, for example, are on average 55.27% more expensive than entire home/apartments. Similarly, private rooms 41.4% cheaper, whereas shared rooms are 59.6% cheaper than entire homes.
Now, before we continue with additional model specifications, we are going to perform a VIF test to check for multicollinearity:
autoplot(model2)car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 11.942272 4 1.363439
number_of_reviews 1.018235 1 1.009076
review_scores_rating 1.009018 1 1.004499
room_type 11.923753 3 1.511479
We can see that room_type and property_type_simplified have VIF scores of more than 10, which means there is potential multicollinearity problem in our model. Before we proceed with the analysis, we will test which variable has a higher predictive power and remove the other:
model2_property <- lm(log_price_4_nights ~ prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = listings_train)
model2_room <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type,
data = listings_train)
huxtable:: huxreg(list("Property Type Model" = model2_property,
"Room Type Model" = model2_room),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma')
) %>%
set_caption('Comparison of Property Type vs. Room Type Models')| Property Type Model | Room Type Model | |
|---|---|---|
| (Intercept) | 5.677 *** | 5.638 *** |
| (0.049) | (0.036) | |
| prop_type_simplifiedEntire rental unit | -0.132 *** | |
| (0.034) | ||
| prop_type_simplifiedEntire serviced apartment | 0.461 *** | |
| (0.053) | ||
| prop_type_simplifiedOther | -0.222 *** | |
| (0.037) | ||
| prop_type_simplifiedPrivate room in rental unit | -0.844 *** | |
| (0.035) | ||
| number_of_reviews | 0.001 *** | 0.001 *** |
| (0.000) | (0.000) | |
| review_scores_rating | 0.036 *** | 0.024 ** |
| (0.008) | (0.008) | |
| room_typeHotel room | 0.608 *** | |
| (0.058) | ||
| room_typePrivate room | -0.701 *** | |
| (0.011) | ||
| room_typeShared room | -0.742 *** | |
| (0.056) | ||
| #observations | 7562 | 7562 |
| R squared | 0.369 | 0.371 |
| Adj. R Squared | 0.368 | 0.370 |
| Residual SE | 0.473 | 0.472 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
model2 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 5.71 | 0.0473 | 121 | 0 | 5.62 | 5.81 |
| prop_type_simplifiedEntire rental unit | -0.133 | 0.033 | -4.01 | 6.07e-05 | -0.197 | -0.0678 |
| prop_type_simplifiedEntire serviced apartment | 0.459 | 0.0513 | 8.95 | 4.27e-19 | 0.359 | 0.56 |
| prop_type_simplifiedOther | 0.0737 | 0.0412 | 1.79 | 0.0737 | -0.00707 | 0.155 |
| prop_type_simplifiedPrivate room in rental unit | -0.31 | 0.0477 | -6.5 | 8.44e-11 | -0.404 | -0.217 |
| number_of_reviews | 0.000575 | 8.9e-05 | 6.46 | 1.13e-10 | 0.0004 | 0.000749 |
| review_scores_rating | 0.0287 | 0.00733 | 3.92 | 8.96e-05 | 0.0144 | 0.0431 |
| room_typeHotel room | 0.44 | 0.0612 | 7.2 | 6.83e-13 | 0.32 | 0.56 |
| room_typePrivate room | -0.535 | 0.0342 | -15.6 | 2.58e-54 | -0.602 | -0.468 |
| room_typeShared room | -0.906 | 0.0598 | -15.2 | 3.43e-51 | -1.02 | -0.789 |
model2 %>%
broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.412 | 0.411 | 0.457 | 587 | 0 | 9 | -4.8e+03 | 9.63e+03 | 9.7e+03 | 1.58e+03 | 7552 | 7562 |
As we can see from the table, the model that includes room types instead of property types has both a higher r-squared and adj. r-squared. To avoid multicollinearity, we will continue using that model and will discard property type.
car::vif(model2_room) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.014033 1 1.006992
review_scores_rating 1.007116 1 1.003552
room_type 1.010184 3 1.001690
Our dataset has many more variables, so we attempt to extend our analysis by adding a number of variables, which could prove useful in increasing the explanatory power of our model.
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?(host_is_superhost) command a pricing premium, after controlling for other variables?instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nightsavalability_30 or reviews_per_month on price_4_nights, after we control for other variables?# Model 3 - Adding property-specific characteristics to Model_Rooms.
model3 <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
beds +
accommodates,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model3" = model3),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model3 | |
|---|---|
| (Intercept) | 4.914 *** |
| (0.041) | |
| number_of_reviews | 0.000 ** |
| (0.000) | |
| review_scores_rating | 0.029 *** |
| (0.007) | |
| room_typeHotel room | 0.717 *** |
| (0.053) | |
| room_typePrivate room | -0.544 *** |
| (0.012) | |
| room_typeShared room | -0.891 *** |
| (0.053) | |
| bedrooms | 0.076 *** |
| (0.013) | |
| bathrooms_num | 0.151 *** |
| (0.018) | |
| beds | 0.016 * |
| (0.007) | |
| accommodates | 0.127 *** |
| (0.007) | |
| #observations | 6903 |
| R squared | 0.509 |
| Adj. R Squared | 0.509 |
| Residual SE | 0.422 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
model3 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 4.91 | 0.0408 | 120 | 0 | 4.83 | 4.99 |
| number_of_reviews | 0.000255 | 8.73e-05 | 2.92 | 0.00351 | 8.37e-05 | 0.000426 |
| review_scores_rating | 0.0287 | 0.00729 | 3.93 | 8.49e-05 | 0.0144 | 0.043 |
| room_typeHotel room | 0.717 | 0.0533 | 13.4 | 1.14e-40 | 0.612 | 0.821 |
| room_typePrivate room | -0.544 | 0.0116 | -47.1 | 0 | -0.567 | -0.522 |
| room_typeShared room | -0.891 | 0.0526 | -16.9 | 4.76e-63 | -0.994 | -0.787 |
| bedrooms | 0.0763 | 0.013 | 5.85 | 5.02e-09 | 0.0507 | 0.102 |
| bathrooms_num | 0.151 | 0.0184 | 8.18 | 3.36e-16 | 0.114 | 0.187 |
| beds | 0.0163 | 0.00697 | 2.34 | 0.0191 | 0.00267 | 0.03 |
| accommodates | 0.127 | 0.00676 | 18.8 | 6.4e-77 | 0.114 | 0.14 |
model3 %>%
broom::glance()| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.509 | 0.509 | 0.422 | 795 | 0 | 9 | -3.84e+03 | 7.7e+03 | 7.77e+03 | 1.23e+03 | 6893 | 6903 |
We see from the table that Model 3 is statistically significant and adding information about property amenities like number of bedrooms and how many people a property can accommodate greatly increases the predictive power of the model. We see an r-squared of 50.9% and an adjusted r-squared of 50.9%. Looking at the statistical significance of the explanatory variables, we find that they are all statistically significant at least a 5% level.
However, we need be weary of potential multicollinearity. Logically, it could expected that bedrooms, beds, and accommodates would be highly correlated. Therefore, we test for multicollinearity by running a VIF-test.
# We test for multicollinearity
car::vif(model3) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.032847 1 1.016291
review_scores_rating 1.008821 1 1.004401
room_type 1.365954 3 1.053350
bedrooms 1.926766 1 1.388080
bathrooms_num 1.116281 1 1.056542
beds 2.577324 1 1.605405
accommodates 3.195903 1 1.787709
Although the results don’t suggest there is multicollinearity problem with Model 3, we will attempt to remove accommodates to see the effect:
# Model 4 A - Model 3 without `accommodates`
model4_a <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
beds,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("No-Accommodates Model" = model4_a),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| No-Accommodates Model | |
|---|---|
| (Intercept) | 5.057 *** |
| (0.041) | |
| number_of_reviews | 0.000 *** |
| (0.000) | |
| review_scores_rating | 0.027 *** |
| (0.007) | |
| room_typeHotel room | 0.686 *** |
| (0.055) | |
| room_typePrivate room | -0.602 *** |
| (0.011) | |
| room_typeShared room | -0.891 *** |
| (0.054) | |
| bedrooms | 0.167 *** |
| (0.012) | |
| bathrooms_num | 0.163 *** |
| (0.019) | |
| beds | 0.096 *** |
| (0.006) | |
| #observations | 6903 |
| R squared | 0.484 |
| Adj. R Squared | 0.484 |
| Residual SE | 0.433 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model4_a) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.028887 1 1.014340
review_scores_rating 1.008604 1 1.004293
room_type 1.269852 3 1.040620
bedrooms 1.662095 1 1.289223
bathrooms_num 1.114865 1 1.055872
beds 1.631100 1 1.277145
The model and all the descriptive statistics prove statistically significant. The predictive power of our model has decreased, however, from 50.9% adjusted r-squared to 48.4%. We will now try removing beds and running the regression with accommodates instead:
# Model 4 B - Model 3 without `beds`
model4_b <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
accommodates,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("No Beds Model" = model4_b),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| No Beds Model | |
|---|---|
| (Intercept) | 4.907 *** |
| (0.041) | |
| number_of_reviews | 0.000 ** |
| (0.000) | |
| review_scores_rating | 0.028 *** |
| (0.007) | |
| room_typeHotel room | 0.719 *** |
| (0.053) | |
| room_typePrivate room | -0.543 *** |
| (0.012) | |
| room_typeShared room | -0.853 *** |
| (0.051) | |
| bedrooms | 0.081 *** |
| (0.013) | |
| bathrooms_num | 0.154 *** |
| (0.018) | |
| accommodates | 0.136 *** |
| (0.005) | |
| #observations | 6947 |
| R squared | 0.510 |
| Adj. R Squared | 0.509 |
| Residual SE | 0.422 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model4_b) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.030454 1 1.015113
review_scores_rating 1.008318 1 1.004150
room_type 1.307427 3 1.045690
bedrooms 1.892493 1 1.375679
bathrooms_num 1.117572 1 1.057153
accommodates 2.029664 1 1.424663
The predictive power of Model 4_b is higher than Model 4_a and all variables are statistically significant at the 5% level. Comparing with model 3 adjusted R-squared, we see that we arrive at a similar explanatory power of 50.9 % with a reduced number of explanatory variables. Furthermore, we see that the VIF-analysis returns a lower level of multicollinearity. Thus, we move forward with model 4_b, renaming it to model4.
# Model 5 - Model 4 + host_is_superhost
# Renaming Model 4 B to Model 4
model4 <- model4_b
model5 <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
accommodates +
host_is_superhost,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model5" = model5),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model5 | |
|---|---|
| (Intercept) | 4.926 *** |
| (0.041) | |
| number_of_reviews | 0.000 |
| (0.000) | |
| review_scores_rating | 0.023 ** |
| (0.007) | |
| room_typeHotel room | 0.712 *** |
| (0.053) | |
| room_typePrivate room | -0.544 *** |
| (0.011) | |
| room_typeShared room | -0.834 *** |
| (0.051) | |
| bedrooms | 0.083 *** |
| (0.013) | |
| bathrooms_num | 0.150 *** |
| (0.018) | |
| accommodates | 0.135 *** |
| (0.005) | |
| host_is_superhostTRUE | 0.080 *** |
| (0.014) | |
| #observations | 6942 |
| R squared | 0.512 |
| Adj. R Squared | 0.512 |
| Residual SE | 0.421 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model5) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.200211 1 1.095542
review_scores_rating 1.022280 1 1.011079
room_type 1.314144 3 1.046583
bedrooms 1.893906 1 1.376193
bathrooms_num 1.118330 1 1.057511
accommodates 2.033802 1 1.426114
host_is_superhost 1.200927 1 1.095868
Model 5 is statistically significant, with a p-value of 0. Additionally, Model 5 has a slightly better predictive power than Model 4, although the difference in adj. r-squared is very small (0.3%). The new variable we added, host_is_superhost, is statistically significant, but now the number_of_reviews is not. One potential explanation is that the the number of reviews has different effect on the price depending on whether the host is a superhost or not. To deal with this, we will create an interaction term between host_is_superhost and number_of_reviews:
model5_inter <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
accommodates+
host_is_superhost+
host_is_superhost*number_of_reviews,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Superhost Model + Interaction" = model5_inter),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Superhost Model + Interaction | |
|---|---|
| (Intercept) | 4.928 *** |
| (0.041) | |
| number_of_reviews | 0.000 ** |
| (0.000) | |
| review_scores_rating | 0.022 ** |
| (0.007) | |
| room_typeHotel room | 0.707 *** |
| (0.053) | |
| room_typePrivate room | -0.544 *** |
| (0.011) | |
| room_typeShared room | -0.840 *** |
| (0.051) | |
| bedrooms | 0.082 *** |
| (0.013) | |
| bathrooms_num | 0.149 *** |
| (0.018) | |
| accommodates | 0.135 *** |
| (0.005) | |
| host_is_superhostTRUE | 0.111 *** |
| (0.017) | |
| number_of_reviews:host_is_superhostTRUE | -0.001 *** |
| (0.000) | |
| #observations | 6942 |
| R squared | 0.513 |
| Adj. R Squared | 0.512 |
| Residual SE | 0.421 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model5_inter) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 2.310959 1 1.520184
review_scores_rating 1.023361 1 1.011613
room_type 1.317293 3 1.047001
bedrooms 1.894883 1 1.376547
bathrooms_num 1.119087 1 1.057869
accommodates 2.035820 1 1.426822
host_is_superhost 1.775598 1 1.332516
number_of_reviews:host_is_superhost 3.271314 1 1.808677
model5_inter %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 4.93 | 0.0406 | 121 | 0 | 4.85 | 5.01 |
| number_of_reviews | 0.000357 | 0.00013 | 2.74 | 0.0061 | 0.000102 | 0.000611 |
| review_scores_rating | 0.0223 | 0.0073 | 3.05 | 0.00227 | 0.00798 | 0.0366 |
| room_typeHotel room | 0.707 | 0.0532 | 13.3 | 7.62e-40 | 0.603 | 0.811 |
| room_typePrivate room | -0.544 | 0.0115 | -47.3 | 0 | -0.566 | -0.521 |
| room_typeShared room | -0.84 | 0.0511 | -16.4 | 1.15e-59 | -0.94 | -0.74 |
| bedrooms | 0.0821 | 0.0129 | 6.38 | 1.83e-10 | 0.0569 | 0.107 |
| bathrooms_num | 0.149 | 0.0183 | 8.14 | 4.56e-16 | 0.113 | 0.185 |
| accommodates | 0.135 | 0.00536 | 25.1 | 2.52e-133 | 0.124 | 0.145 |
| host_is_superhostTRUE | 0.111 | 0.0169 | 6.61 | 4.07e-11 | 0.0784 | 0.144 |
| number_of_reviews:host_is_superhostTRUE | -0.000616 | 0.000187 | -3.3 | 0.000961 | -0.000982 | -0.000251 |
First, we see that including the interaction term made number_of_reviews statistically significant again but the new model has approximately the same predictive power as the old one (51.2% adj. R squared). The way to interpret the new coefficient is as follows: when the host is not superhost, each additional review corresponds to approximately 0.036% higher price. When the host is a superhost, each additional review corresponds to 0.026% lower price. Additionally, being a superhost corresponds to 11.2% higher price when controlling for the number of reviews. Thus, all variables are statistically significant and the predictive power of our model is the highest we have achieved so far.
Next, we would like to include the variable neighbourhood_simplified with the following code:
# Renaming Model 5 Interaction to Model 5
model5 <- model5_inter
model6 <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
accommodates +
host_is_superhost +
neighbourhood_simplified,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model6" = model6),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model6 | |
|---|---|
| (Intercept) | 4.951 *** |
| (0.044) | |
| number_of_reviews | -0.000 |
| (0.000) | |
| review_scores_rating | 0.024 *** |
| (0.007) | |
| room_typeHotel room | 0.700 *** |
| (0.053) | |
| room_typePrivate room | -0.534 *** |
| (0.011) | |
| room_typeShared room | -0.868 *** |
| (0.050) | |
| bedrooms | 0.088 *** |
| (0.013) | |
| bathrooms_num | 0.137 *** |
| (0.018) | |
| accommodates | 0.134 *** |
| (0.005) | |
| host_is_superhostTRUE | 0.088 *** |
| (0.014) | |
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.018 |
| (0.021) | |
| neighbourhood_simplifiedMitte | 0.063 ** |
| (0.021) | |
| neighbourhood_simplifiedNeukölln | -0.125 *** |
| (0.023) | |
| neighbourhood_simplifiedOther | -0.152 *** |
| (0.024) | |
| neighbourhood_simplifiedPankow | 0.003 |
| (0.022) | |
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.060 * |
| (0.027) | |
| #observations | 6942 |
| R squared | 0.527 |
| Adj. R Squared | 0.526 |
| Residual SE | 0.415 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.209538 1 1.099790
review_scores_rating 1.023609 1 1.011736
room_type 1.338472 3 1.049788
bedrooms 1.899468 1 1.378212
bathrooms_num 1.125743 1 1.061010
accommodates 2.042343 1 1.429106
host_is_superhost 1.208290 1 1.099222
neighbourhood_simplified 1.054832 6 1.004458
model6 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 4.95 | 0.0437 | 113 | 0 | 4.86 | 5.04 |
| number_of_reviews | -5.31e-05 | 9.27e-05 | -0.573 | 0.567 | -0.000235 | 0.000129 |
| review_scores_rating | 0.0239 | 0.00719 | 3.32 | 0.000908 | 0.00978 | 0.038 |
| room_typeHotel room | 0.7 | 0.0526 | 13.3 | 5.9e-40 | 0.597 | 0.803 |
| room_typePrivate room | -0.534 | 0.0114 | -47 | 0 | -0.556 | -0.512 |
| room_typeShared room | -0.868 | 0.0504 | -17.2 | 4.01e-65 | -0.967 | -0.769 |
| bedrooms | 0.0884 | 0.0127 | 6.97 | 3.53e-12 | 0.0635 | 0.113 |
| bathrooms_num | 0.137 | 0.0181 | 7.6 | 3.27e-14 | 0.102 | 0.173 |
| accommodates | 0.134 | 0.00529 | 25.4 | 7.95e-136 | 0.124 | 0.145 |
| host_is_superhostTRUE | 0.0878 | 0.0137 | 6.41 | 1.59e-10 | 0.0609 | 0.115 |
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.018 | 0.0208 | 0.861 | 0.389 | -0.0229 | 0.0588 |
| neighbourhood_simplifiedMitte | 0.063 | 0.0212 | 2.97 | 0.00298 | 0.0214 | 0.105 |
| neighbourhood_simplifiedNeukölln | -0.125 | 0.0226 | -5.51 | 3.69e-08 | -0.169 | -0.0804 |
| neighbourhood_simplifiedOther | -0.152 | 0.0235 | -6.46 | 1.11e-10 | -0.198 | -0.106 |
| neighbourhood_simplifiedPankow | 0.00261 | 0.0219 | 0.119 | 0.905 | -0.0404 | 0.0456 |
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.0598 | 0.0272 | -2.2 | 0.0281 | -0.113 | -0.00643 |
Model 6 is statistically significant and has the highest adjusted r-squared so far at 52.6%. Additionally, 52.7% of the variation in log_price_4_nights can be explained by Model 6. The variable neighbourhood_simplified has 6 possible values: Friedrichshain-Kreuzberg, Mitte, Pankow, Neukölln, Charlottenburg-Wilm, Tempelhof - Schöneberg, and Other. Since neighbourhood_simplified is a categorical variable, only 5 possible outcomes have their own coefficient, and the last one,Charlottenburg-Wilm , is incorporated into the intercept. Although not all neighbourhoods have a statistically significant effect on log price, properties in Mitte are on average 6.5% more expensive, where as properties in Neukolln, Pankow, and Tempelhof are 11.7%, 0.3%, and 5.8% cheaper than Charlottenburg-Wilm respectively.
Since some neighbourhoods have a significant effect on log price and this model has the highest predictive power so far, we will proceed with Model 6.
Next, we would like to add availability_30 or reviews_per_month to our analysis with the following code:
model7 <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num +
accommodates+
host_is_superhost+
host_is_superhost*number_of_reviews+
neighbourhood_simplified+
availability_30+
reviews_per_month,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model 7" = model7),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model 7 | |
|---|---|
| (Intercept) | 4.821 *** |
| (0.041) | |
| number_of_reviews | -0.000 ** |
| (0.000) | |
| review_scores_rating | 0.038 *** |
| (0.007) | |
| room_typeHotel room | 0.416 *** |
| (0.050) | |
| room_typePrivate room | -0.542 *** |
| (0.011) | |
| room_typeShared room | -1.085 *** |
| (0.047) | |
| bedrooms | 0.108 *** |
| (0.012) | |
| bathrooms_num | 0.132 *** |
| (0.017) | |
| accommodates | 0.112 *** |
| (0.005) | |
| host_is_superhostTRUE | 0.094 *** |
| (0.016) | |
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.076 *** |
| (0.019) | |
| neighbourhood_simplifiedMitte | 0.091 *** |
| (0.020) | |
| neighbourhood_simplifiedNeukölln | -0.060 ** |
| (0.021) | |
| neighbourhood_simplifiedOther | -0.156 *** |
| (0.022) | |
| neighbourhood_simplifiedPankow | 0.050 * |
| (0.020) | |
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.032 |
| (0.025) | |
| availability_30 | 0.023 *** |
| (0.001) | |
| reviews_per_month | 0.010 * |
| (0.004) | |
| number_of_reviews:host_is_superhostTRUE | -0.000 |
| (0.000) | |
| #observations | 6942 |
| R squared | 0.593 |
| Adj. R Squared | 0.592 |
| Residual SE | 0.385 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model7) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 2.866633 1 1.693113
review_scores_rating 1.032123 1 1.015935
room_type 1.416314 3 1.059725
bedrooms 1.907454 1 1.381106
bathrooms_num 1.126554 1 1.061393
accommodates 2.078562 1 1.441722
host_is_superhost 1.799955 1 1.341624
neighbourhood_simplified 1.082130 6 1.006599
availability_30 1.180472 1 1.086495
reviews_per_month 1.628604 1 1.276168
number_of_reviews:host_is_superhost 3.299141 1 1.816354
model7 %>%
broom::tidy(conf.int=TRUE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 4.82 | 0.0407 | 118 | 0 | 4.74 | 4.9 |
| number_of_reviews | -0.000402 | 0.000132 | -3.04 | 0.00241 | -0.000662 | -0.000142 |
| review_scores_rating | 0.0382 | 0.0067 | 5.7 | 1.28e-08 | 0.025 | 0.0513 |
| room_typeHotel room | 0.416 | 0.0495 | 8.4 | 5.22e-17 | 0.319 | 0.513 |
| room_typePrivate room | -0.542 | 0.0106 | -51.3 | 0 | -0.563 | -0.522 |
| room_typeShared room | -1.09 | 0.0473 | -22.9 | 3.21e-112 | -1.18 | -0.992 |
| bedrooms | 0.108 | 0.0118 | 9.12 | 9.43e-20 | 0.0845 | 0.131 |
| bathrooms_num | 0.132 | 0.0168 | 7.89 | 3.55e-15 | 0.0994 | 0.165 |
| accommodates | 0.112 | 0.00495 | 22.7 | 2.41e-110 | 0.103 | 0.122 |
| host_is_superhostTRUE | 0.0945 | 0.0155 | 6.09 | 1.2e-09 | 0.0641 | 0.125 |
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.0762 | 0.0194 | 3.92 | 8.81e-05 | 0.0381 | 0.114 |
| neighbourhood_simplifiedMitte | 0.0908 | 0.0197 | 4.61 | 4.14e-06 | 0.0521 | 0.129 |
| neighbourhood_simplifiedNeukölln | -0.0599 | 0.0211 | -2.84 | 0.00457 | -0.101 | -0.0185 |
| neighbourhood_simplifiedOther | -0.156 | 0.0218 | -7.13 | 1.1e-12 | -0.198 | -0.113 |
| neighbourhood_simplifiedPankow | 0.05 | 0.0204 | 2.45 | 0.0143 | 0.00997 | 0.0899 |
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.0316 | 0.0253 | -1.25 | 0.212 | -0.0811 | 0.018 |
| availability_30 | 0.0225 | 0.0007 | 32.1 | 2.13e-211 | 0.0211 | 0.0239 |
| reviews_per_month | 0.00982 | 0.00403 | 2.44 | 0.0148 | 0.00192 | 0.0177 |
| number_of_reviews:host_is_superhostTRUE | -0.000213 | 0.000171 | -1.24 | 0.215 | -0.000548 | 0.000123 |
Model 7 is statistically significant and has the highest predictive power so far at 59.2% adjusted r-squared. Additionally, 59.3% of the variation in log_price_4_nights is explained by the model. Both availability_30 and reviews_per_month are statistically significant. Look at the VIF test, we see that there is no problem with multicollinearity, which would suggest we should leave both variables in. Looking at number_of_reviews, we see that while the variable’s coefficient was insignificant in Model 6, it is significant now. In other words, controlling for the reviews per month a property receives makes the number of reviews a property has significant. This means we should include reviews_per_month in our model, despite it being not statistically significant.
Following this, we would like to increase the predictive power of our model as much as possible. To do that we will start testing the predictive power of the following variables one by one:
model8 <- lm(log_price_4_nights ~ number_of_reviews +
review_scores_rating +
room_type +
bedrooms +
bathrooms_num+
accommodates +
host_is_superhost +
neighbourhood_simplified +
availability_30 +
reviews_per_month +
review_scores_value +
review_scores_cleanliness +
review_scores_location +
review_scores_checkin +
review_scores_accuracy +
latitude +
longitude +
number_of_reviews_l30d +
calculated_host_listings_count,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model8" = model8),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model8 | |
|---|---|
| (Intercept) | 71.062 *** |
| (11.424) | |
| number_of_reviews | -0.001 *** |
| (0.000) | |
| review_scores_rating | 0.175 *** |
| (0.022) | |
| room_typeHotel room | 0.370 *** |
| (0.048) | |
| room_typePrivate room | -0.526 *** |
| (0.010) | |
| room_typeShared room | -1.149 *** |
| (0.050) | |
| bedrooms | 0.112 *** |
| (0.012) | |
| bathrooms_num | 0.124 *** |
| (0.016) | |
| accommodates | 0.108 *** |
| (0.005) | |
| host_is_superhostTRUE | 0.066 *** |
| (0.012) | |
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.101 *** |
| (0.023) | |
| neighbourhood_simplifiedMitte | 0.138 *** |
| (0.022) | |
| neighbourhood_simplifiedNeukölln | -0.050 * |
| (0.024) | |
| neighbourhood_simplifiedOther | -0.100 *** |
| (0.025) | |
| neighbourhood_simplifiedPankow | 0.130 *** |
| (0.025) | |
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.046 |
| (0.026) | |
| availability_30 | 0.020 *** |
| (0.001) | |
| reviews_per_month | 0.024 *** |
| (0.005) | |
| review_scores_value | -0.247 *** |
| (0.019) | |
| review_scores_cleanliness | 0.123 *** |
| (0.013) | |
| review_scores_location | 0.181 *** |
| (0.016) | |
| review_scores_checkin | -0.034 * |
| (0.017) | |
| review_scores_accuracy | -0.055 ** |
| (0.020) | |
| latitude | -1.195 *** |
| (0.212) | |
| longitude | -0.298 ** |
| (0.095) | |
| number_of_reviews_l30d | -0.019 *** |
| (0.004) | |
| calculated_host_listings_count | 0.006 *** |
| (0.001) | |
| #observations | 6823 |
| R squared | 0.620 |
| Adj. R Squared | 0.618 |
| Residual SE | 0.372 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model8) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.698150 1 1.303131
review_scores_rating 3.330309 1 1.824914
room_type 1.685812 3 1.090942
bedrooms 1.938326 1 1.392238
bathrooms_num 1.131639 1 1.063785
accommodates 2.153714 1 1.467554
host_is_superhost 1.231920 1 1.109919
neighbourhood_simplified 4.192234 6 1.126861
availability_30 1.320644 1 1.149193
reviews_per_month 2.637634 1 1.624079
review_scores_value 2.984940 1 1.727698
review_scores_cleanliness 1.988977 1 1.410311
review_scores_location 1.579065 1 1.256609
review_scores_checkin 1.727042 1 1.314170
review_scores_accuracy 2.624735 1 1.620104
latitude 2.261321 1 1.503769
longitude 1.733604 1 1.316664
number_of_reviews_l30d 2.107147 1 1.451601
calculated_host_listings_count 1.434748 1 1.197810
This model gives us the highest adjusted R-squared so far (61.8%) and explains 62.0% of the variation in the log price. In addition, all the coefficients except for neighbourhood_simplifiedTempelhof - Schöneberg are statistically significant at the 5% level. However, we will not remove this as this forms part of a categorical variable whose other coefficients are significant.
Despite this, the addition of all these new variables has made the model complicated. In our next model, we observe that when retaining significant and relevant variables, with only 7 variables, we are able to reduce number of explanatory variables by 6 and only drop explanatory power by 4.4 %, which leads to a less complicated and superior model from a complexity vs. explanatory power-tradeoff while avoiding overfitting the model to the training data.
model9 <- lm(log_price_4_nights ~
room_type +
accommodates +
bathrooms_num+
host_is_superhost +
availability_30 +
review_scores_value +
review_scores_cleanliness +
review_scores_location,
data = listings_train)
# We construct a table with the regression output
huxtable:: huxreg(list("Model9" = model9),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| Model9 | |
|---|---|
| (Intercept) | 4.321 *** |
| (0.069) | |
| room_typeHotel room | 0.397 *** |
| (0.049) | |
| room_typePrivate room | -0.536 *** |
| (0.010) | |
| room_typeShared room | -1.067 *** |
| (0.047) | |
| accommodates | 0.135 *** |
| (0.004) | |
| bathrooms_num | 0.181 *** |
| (0.016) | |
| host_is_superhostTRUE | 0.041 *** |
| (0.011) | |
| availability_30 | 0.021 *** |
| (0.001) | |
| review_scores_value | -0.222 *** |
| (0.015) | |
| review_scores_cleanliness | 0.155 *** |
| (0.012) | |
| review_scores_location | 0.217 *** |
| (0.015) | |
| #observations | 7396 |
| R squared | 0.582 |
| Adj. R Squared | 0.582 |
| Residual SE | 0.385 |
| P-Value | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |
car::vif(model9) GVIF Df GVIF^(1/(2*Df))
room_type 1.320472 3 1.047422
accommodates 1.318854 1 1.148413
bathrooms_num 1.064827 1 1.031905
host_is_superhost 1.046557 1 1.023014
availability_30 1.116529 1 1.056659
review_scores_value 1.952214 1 1.397216
review_scores_cleanliness 1.646925 1 1.283326
review_scores_location 1.390104 1 1.179027
We believe that Model 9, our latest model, strikes a good balance between simplicity and explanatory power and the variables in the model reflect our intuition about predictors of price.
Now, we check the residuals of our models.
autoplot(model1)autoplot(model2)autoplot(model3)autoplot(model4)autoplot(model5)autoplot(model6)autoplot(model7)autoplot(model8)autoplot(model9)huxtable:: huxreg(list(model1,model2, model3, model4, model5, model6, model7, model8, model9),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'P-Value' = 'p.value')
)| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 5.677 *** | 5.713 *** | 4.914 *** | 4.907 *** | 4.928 *** | 4.951 *** | 4.821 *** | 71.062 *** | 4.321 *** |
| (0.049) | (0.047) | (0.041) | (0.041) | (0.041) | (0.044) | (0.041) | (11.424) | (0.069) | |
| prop_type_simplifiedEntire rental unit | -0.132 *** | -0.133 *** | |||||||
| (0.034) | (0.033) | ||||||||
| prop_type_simplifiedEntire serviced apartment | 0.461 *** | 0.459 *** | |||||||
| (0.053) | (0.051) | ||||||||
| prop_type_simplifiedOther | -0.222 *** | 0.074 | |||||||
| (0.037) | (0.041) | ||||||||
| prop_type_simplifiedPrivate room in rental unit | -0.844 *** | -0.310 *** | |||||||
| (0.035) | (0.048) | ||||||||
| number_of_reviews | 0.001 *** | 0.001 *** | 0.000 ** | 0.000 ** | 0.000 ** | -0.000 | -0.000 ** | -0.001 *** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | 0.036 *** | 0.029 *** | 0.029 *** | 0.028 *** | 0.022 ** | 0.024 *** | 0.038 *** | 0.175 *** | |
| (0.008) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.022) | ||
| room_typeHotel room | 0.440 *** | 0.717 *** | 0.719 *** | 0.707 *** | 0.700 *** | 0.416 *** | 0.370 *** | 0.397 *** | |
| (0.061) | (0.053) | (0.053) | (0.053) | (0.053) | (0.050) | (0.048) | (0.049) | ||
| room_typePrivate room | -0.535 *** | -0.544 *** | -0.543 *** | -0.544 *** | -0.534 *** | -0.542 *** | -0.526 *** | -0.536 *** | |
| (0.034) | (0.012) | (0.012) | (0.011) | (0.011) | (0.011) | (0.010) | (0.010) | ||
| room_typeShared room | -0.906 *** | -0.891 *** | -0.853 *** | -0.840 *** | -0.868 *** | -1.085 *** | -1.149 *** | -1.067 *** | |
| (0.060) | (0.053) | (0.051) | (0.051) | (0.050) | (0.047) | (0.050) | (0.047) | ||
| bedrooms | 0.076 *** | 0.081 *** | 0.082 *** | 0.088 *** | 0.108 *** | 0.112 *** | |||
| (0.013) | (0.013) | (0.013) | (0.013) | (0.012) | (0.012) | ||||
| bathrooms_num | 0.151 *** | 0.154 *** | 0.149 *** | 0.137 *** | 0.132 *** | 0.124 *** | 0.181 *** | ||
| (0.018) | (0.018) | (0.018) | (0.018) | (0.017) | (0.016) | (0.016) | |||
| beds | 0.016 * | ||||||||
| (0.007) | |||||||||
| accommodates | 0.127 *** | 0.136 *** | 0.135 *** | 0.134 *** | 0.112 *** | 0.108 *** | 0.135 *** | ||
| (0.007) | (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | (0.004) | |||
| host_is_superhostTRUE | 0.111 *** | 0.088 *** | 0.094 *** | 0.066 *** | 0.041 *** | ||||
| (0.017) | (0.014) | (0.016) | (0.012) | (0.011) | |||||
| number_of_reviews:host_is_superhostTRUE | -0.001 *** | -0.000 | |||||||
| (0.000) | (0.000) | ||||||||
| neighbourhood_simplifiedFriedrichshain-Kreuzberg | 0.018 | 0.076 *** | 0.101 *** | ||||||
| (0.021) | (0.019) | (0.023) | |||||||
| neighbourhood_simplifiedMitte | 0.063 ** | 0.091 *** | 0.138 *** | ||||||
| (0.021) | (0.020) | (0.022) | |||||||
| neighbourhood_simplifiedNeukölln | -0.125 *** | -0.060 ** | -0.050 * | ||||||
| (0.023) | (0.021) | (0.024) | |||||||
| neighbourhood_simplifiedOther | -0.152 *** | -0.156 *** | -0.100 *** | ||||||
| (0.024) | (0.022) | (0.025) | |||||||
| neighbourhood_simplifiedPankow | 0.003 | 0.050 * | 0.130 *** | ||||||
| (0.022) | (0.020) | (0.025) | |||||||
| neighbourhood_simplifiedTempelhof - Schöneberg | -0.060 * | -0.032 | -0.046 | ||||||
| (0.027) | (0.025) | (0.026) | |||||||
| availability_30 | 0.023 *** | 0.020 *** | 0.021 *** | ||||||
| (0.001) | (0.001) | (0.001) | |||||||
| reviews_per_month | 0.010 * | 0.024 *** | |||||||
| (0.004) | (0.005) | ||||||||
| review_scores_value | -0.247 *** | -0.222 *** | |||||||
| (0.019) | (0.015) | ||||||||
| review_scores_cleanliness | 0.123 *** | 0.155 *** | |||||||
| (0.013) | (0.012) | ||||||||
| review_scores_location | 0.181 *** | 0.217 *** | |||||||
| (0.016) | (0.015) | ||||||||
| review_scores_checkin | -0.034 * | ||||||||
| (0.017) | |||||||||
| review_scores_accuracy | -0.055 ** | ||||||||
| (0.020) | |||||||||
| latitude | -1.195 *** | ||||||||
| (0.212) | |||||||||
| longitude | -0.298 ** | ||||||||
| (0.095) | |||||||||
| number_of_reviews_l30d | -0.019 *** | ||||||||
| (0.004) | |||||||||
| calculated_host_listings_count | 0.006 *** | ||||||||
| (0.001) | |||||||||
| #observations | 7562 | 7562 | 6903 | 6947 | 6942 | 6942 | 6942 | 6823 | 7396 |
| R squared | 0.369 | 0.412 | 0.509 | 0.510 | 0.513 | 0.527 | 0.593 | 0.620 | 0.582 |
| Adj. R Squared | 0.368 | 0.411 | 0.509 | 0.509 | 0.512 | 0.526 | 0.592 | 0.618 | 0.582 |
| Residual SE | 0.473 | 0.457 | 0.422 | 0.422 | 0.421 | 0.415 | 0.385 | 0.372 | 0.385 |
| P-Value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||||||
We now take Model 9 RMSE on the out-of-sample data. We also calculate the RMSE on the in-sample data as a benchmark.
rmse_train<-listings_train %>%
mutate(predictions = predict(model9,.)) %>%
select(predictions,log_price_4_nights) %>%
mutate(squared_error = (predictions - log_price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error, na.rm=TRUE))) %>%
pull
rmse_train[1] 0.384718
rmse_test <- listings_test %>%
mutate(predictions = predict(model9,.)) %>%
select(predictions,log_price_4_nights) %>%
mutate(squared_error = (predictions - log_price_4_nights)^2) %>%
summarise(rmse = sqrt(mean(squared_error, na.rm=TRUE))) %>%
pull()
rmse_test[1] 0.3792494
The RMSEs are similar for the in-sample and out-of-sample data, so we can see that Model 9 does not over fit.
Finally, we calculate the predictions of price using Model 9 for apartments with a private room, have at least 10 reviews, and which have an average rating of at least 4.5
subset<-listings_test %>%
filter(
room_type=="Private room",
number_of_reviews>=10,
review_scores_rating>=4.5
)
predicted_logprices<-model9 %>%
predict(subset) %>%
as.data.frame()
favstats(~., data=predicted_logprices) %>%
mutate(lower=mean-1.96*sd/sqrt(n),
upper=mean+1.96*sd/sqrt(n),
lower_nominal=exp(lower), #Here we convert from log prices to prices.
upper_nominal=exp(upper),
mean_nominal=exp(mean),
min_nominal=exp(min),
max_nominal=exp(max)) %>%
select(min_nominal,
mean_nominal,
max_nominal,
lower_nominal,
upper_nominal
)| min_nominal | mean_nominal | max_nominal | lower_nominal | upper_nominal |
|---|---|---|---|---|
| 117 | 163 | 414 | 159 | 166 |